Skip to main content

Dataset Creation and Anonymization Process

This guide outlines the steps for creating and anonymizing datasets within AI Studio. It covers the process of uploading data, selecting specific tables and columns, applying filters, anonymizing sensitive information, and exporting the final dataset.


Steps to Create a Dataset Using Integration

The process of dataset creation involves five key steps. Below is a detailed walkthrough of each stage:

Step 1: Configuration

  • Upload a Dataset File: The first step is to either upload a new dataset or choose an existing one from the database. You have two primary options:
    • File Upload: Click or drag your dataset files into the designated upload area.
    • Database Integration: Select a pre-configured database (like MySQL) to directly pull data from it.
tip

You can either upload new files or choose datasets that are already available within your workspace.


Step 2: Select Tables and Columns

  • Choose Tables and Columns: After uploading or selecting your dataset, proceed to pick the specific tables and columns you want to include in your dataset.
    • You can either select individual tables or opt to include all tables and columns.
    • Refine your dataset by picking only the columns that are relevant to your task.
tip

By selecting only the necessary columns, you can create a more focused and precise dataset.


Step 3: Apply Filters

  • Apply Filters: To further refine your dataset, apply filters on the selected columns.
    • You can add multiple filters based on specific criteria to extract the most relevant data.
tip

Using filters ensures that you only retrieve the data that is most useful for your analysis, eliminating unnecessary information.


Step 4: Data Anonymization

  • Anonymize Sensitive Data: In this step, you will apply anonymization methods to sensitive data fields to ensure privacy.
    • Set up transformers to anonymize personal information and other sensitive fields.
    • Choose the right anonymization approach based on the privacy needs of your data.
tip

Anonymization helps protect sensitive data while ensuring that it is still usable for analysis and other purposes.


Step 5: Export the Dataset

  • Export the Dataset: In the final step, you will export your anonymized dataset in the format you prefer:
    • JSON
    • CSV
    • JSONL

You can select the export type, either as a table or as a downloadable dataset file. Once you've selected the export format, generate the final dataset based on your configured settings.

tip

Choose the export format that best suits your needs for future use or integration with other systems.


Generating Your Dataset

After completing all five steps, you will be able to generate your customized and anonymized dataset in your chosen format. Whether it's JSON, CSV, or JSONL, your final dataset will be ready for use in your AI projects.